Multi-layer Neural Network

By virture of being here, it is assumed that you have gone through the Quick Start. To recap the Quicks tart tutorial, We imported MNIST dataset and trained a Logistic Regression which produces a linear classification boundary. It is impossible to learn complex functions like XOR with linear classification boundary.

A Neural Network is a function approximator consisting of several neurons organized in a layered fashion. Each neuron takes input from previous layer, performs some mathematical calculation and sends output to next layer. A neuron produces output only if the result of the calculation it performs is greater than some threshold. This threshold function is called activation function. Depending on the type of the task different activation functions can be used. Some of the most commonly used activation functions are sigmoid, tanh, ReLu and maxout. It is inspired from the functioning of human brain where one neuron sends signal to other neuron only if the electical signal in the first neuron is greater than some threshold.

A Feed Forward Neural network/ multi-layer perceptron has an input layer, an output layer and some hidden layers. The actual magic of the neural networks happens in the hidden layers and they represent the function the network is trying to approximate. Output layer is generally a softmax function that converts the inputs into probabilities. Let us look at the mathematical representation of the hidden layer and output layer

Hidden layer:

let $[a_{i-1}^1], a_{i-1}^2, a_{i-1}^3 ........ a_{i-1}^n]$ be the activations of the previous layer $i-1$ $$h_i = w_i^0 + w_i^1a_{i-1}^1 + w_i^2a_{i-1}^2 + ...... + w_i^na_{i-1}^n$$ $$a_i = act(h_i)$$ Where i is the layer number, $[w_i^1, w_i^2, w_i^3, ......... w_i^n]$ be the parameters between the $i^{th}$ and $(i-1)^{th}$ layer, $w_i^0$ is the bias which is the input when there is no activation from the previous layer, 1,2....n are the dimensions of the layer, $a_i$ is the activation at the layer, and $act()$ is the activation function for that layer.

Output layer:

let our network has l layers $$z = w_i^0 + w_i^1a_{i-1}^1 + w_i^2a_{i-1}^2 + ...... + w_i^na_{i-1}^n$$ $$a = softmax(z)$$ $$correct class = argmax(a)$$

Where a represents the output probabilities, z represents the weighted activations of the previous layer.

Neural Network training:-

Neural Network has a lot of parameters to learn. Consider a neural network with 2 layers of each 100 neurons and input dimension of 1024 and 10 outputs. Then the number of parameters to learn is 1024 100 100 * 10 i.e., 102400000 parameters. Learning these many parameters is a complex task because for each parameter we need to calculate the gradient of error function and update the parameters with that gradient. The computational instability of this process is the reason for neural networks to loose it's charm quickly. There is a technique called Back propagation that solved this problem. The following section gives a brief insight into the backpropagation technique.

Back Propagation:

YANN handles the Back propagation by itself. But, it does not hurt to know how it works. A neural network can be represented mathematically as $$O = f_1(W_l(f_2(W_{l-1}f_3(..f_n(WX)..)))$$ where $f_1, f_2, f_3$ are activation functions. An Error function can be represented as $$E(f_1(W_l(f_2(W_{l-1}f_3(..f_n(WX)..))))$$ where $E()$ is some error function. The gradient of $W_l$ is given by:

$$g_l = \frac{\partial E(f_1(W_lf_2(W_{l-1}f_3(..f_n(WX)..))))}{\partial W_l} $$

Applying chain rule: $$g_l = \frac{\partial E(f_1())}{\partial f_1}\frac{\partial f_1}{\partial W_l} $$ The gradient of error w.r.t $W_{l-1}$ after applying chain rule: $$g_l = \frac{\partial E(f_1())}{\partial f_1}\frac{\partial f_1(W_lf_2())}{\partial f_2}\frac{\partial f_2()}{\partial W_2} $$

In the above equations the first term $\frac{\partial E(f_1())}{\partial f_1}$ remains same for both gradients. Similarly for rest of the parameters we reuse the terms from the previous gradient calculation. This process drastically reduces the number of calculations in Neural Network training.

Let us take this one step further and create a neural network with two hidden layers. We begin as usual by importing the network class and creating the input layer.



In [2]:

    
from yann.network import network
from yann.special.datasets import cook_mnist

data = cook_mnist()
dataset_params  = { "dataset": data.dataset_location(), "id": 'mnist', "n_classes" : 10 }

net = network()
net.add_layer(type = "input", id ="input", dataset_init_args = dataset_params)









    



. Setting up dataset 
.. setting up skdata
... Importing mnist from skdata
.. setting up dataset
.. training data
.. validation data 
.. testing data 
. Dataset 24331 is created.
. Time taken is 0.936363 seconds
. Initializing the network
.. Adding input layer input

In Instead of connecting this to a classfier as we saw in the Quick Start , let us add a couple of fully connected hidden layers. Hidden layers can be created using layer type = dot_product.



In [3]:

    
net.add_layer (type = "dot_product",
               origin ="input",
               id = "dot_product_1",
               num_neurons = 800,
               regularize = True,
               activation ='relu')

net.add_layer (type = "dot_product",
               origin ="dot_product_1",
               id = "dot_product_2",
               num_neurons = 800,
               regularize = True,
               activation ='relu')









    



.. Adding dot_product layer dot_product_1
.. Adding flatten layer 2
.. Adding dot_product layer dot_product_2

Notice the parameters passed. num_neurons is the number of nodes in the layer. Notice also how we modularized the layers by using the id parameter. origin represents which layer will be the input to the new layer. By default yann assumes all layers are input serially and chooses the last added layer to be the input. Using origin, one can create various types of architectures. Infact any directed acyclic graphs (DAGs) that could be hand-drawn could be implemented. Let us now add a classifier and an objective layer to this.



In [4]:

    
net.add_layer ( type = "classifier",
                id = "softmax",
                origin = "dot_product_2",
                num_classes = 10,
                activation = 'softmax',
                )

net.add_layer ( type = "objective",
                id = "nll",
                origin = "softmax",
                )









    



.. Adding classifier layer softmax
.. Adding objective layer nll

The following block is something we did not use in the Quick Start tutorial. We are adding optimizer and optimizer parameters to the network. Let us create our own optimizer module this time instead of using the yann default. For any module in yann, the initialization can be done using the add_module method. The add_module method typically takes input type which in this case is optimizer and a set of intitliazation parameters which in our case is params = optimizer_params. Any module params, which in this case is the optimizer_params is a dictionary of relevant options. If you are not familiar with the optimizers in neural network, I would suggest you to go through the Optimizers to Neural network series of tutorials to get familiar with the effect of differnt optimizers in a Nueral Network.

A typical optimizer setup is:



In [10]:

    
optimizer_params =  {
            "momentum_type"       : 'polyak',
            "momentum_params"     : (0.9, 0.95, 30),
            "regularization"      : (0.0001, 0.0002),
            "optimizer_type"      : 'rmsprop',
            "id"                  : 'polyak-rms'
                    }
net.add_module ( type = 'optimizer', params = optimizer_params )









    



.. Setting up the optimizer

We have now successfully added a Polyak momentum with RmsProp back propagation with some and co-efficients that will be applied to the layers for which we passed as argument regularize = True. For more options of parameters on optimizer refer to the optimizer documentation . This optimizer will therefore solve the following error:

where is the error, is the sigmoid layer and is the ith layer of the network.



In [11]:

    
learning_rates = (0.05, 0.01, 0.001)









    



.. Cooking the network
.. All checks complete, cooking continues

The learning_rate, supplied here is a tuple. The first indicates a annealing of a linear rate, the second is the initial learning rate of the first era, and the third value is the leanring rate of the second era. Accordingly, epochs takes in a tuple with number of epochs for each era.

Noe we can cook, train and test as usual:



In [12]:

    
net.cook( optimizer = 'polyak-rms',
          objective_layer = 'nll',
          datastream = 'mnist',
          classifier = 'softmax',
          )

net.train( epochs = (20, 20),
           validate_after_epochs = 2,
           training_accuracy = True,
           learning_rates = learning_rates,
           show_progress = True,
           early_terminate = True)









    



.. Cooking the network
.. All checks complete, cooking continues
. Training
. 

.. Epoch: 0 Era: 0






    



| training  100% Time: 0:00:00                                                 
| validation  100% Time: 0:00:00                                               






    



.. Validation accuracy : 97.37
.. Training accuracy : 98.304
.. Best training accuracy
.. Best validation accuracy
.. Cost                : 0.0788697
... Learning Rate       : 0.00999999977648
... Momentum            : 0.899999976158
. 

.. Epoch: 1 Era: 0






    



| training  100% Time: 0:00:00                                                 
/ training   17% ETA:  0:00:00                                                 






    



.. Cost                : 0.0589618
... Learning Rate       : 0.00949999969453
... Momentum            : 0.901666641235
. 

.. Epoch: 2 Era: 0






    



| training  100% Time: 0:00:00                                                 
| validation  100% Time: 0:00:00                                               






    



.. Validation accuracy : 97.27
.. Training accuracy : 98.568
.. Best training accuracy
.. Cost                : 0.0492305
... Learning Rate       : 0.00902500003576
... Momentum            : 0.903333306313
. 

.. Epoch: 3 Era: 0






    



| training  100% Time: 0:00:00                                                 
/ training   17% ETA:  0:00:00                                                 






    



.. Cost                : 0.039751
... Learning Rate       : 0.00857375003397
... Momentum            : 0.90499997139
.. Patience ran out lowering learning rate.
. 

.. Epoch: 4 Era: 0






    



| training  100% Time: 0:00:00                                                 
| validation  100% Time: 0:00:00                                               






    



.. Validation accuracy : 98.08
.. Training accuracy : 99.396
.. Best training accuracy
.. Best validation accuracy
.. Cost                : 0.0666344
... Learning Rate       : 0.000814506202005
... Momentum            : 0.906666636467
.. Patience ran out lowering learning rate.
. 

.. Epoch: 5 Era: 0






    



| training  100% Time: 0:00:00                                                 
/ training   17% ETA:  0:00:00                                                 






    



.. Cost                : 0.0196242
... Learning Rate       : 7.73780848249e-05
... Momentum            : 0.908333301544
.. Patience ran out lowering learning rate.
. 

.. Epoch: 6 Era: 0






    



| training  100% Time: 0:00:00                                                 
| validation  100% Time: 0:00:00                                               






    



.. Validation accuracy : 98.1
.. Training accuracy : 99.4
.. Best training accuracy
.. Best validation accuracy
.. Cost                : 0.0192627
... Learning Rate       : 7.3509181675e-06
... Momentum            : 0.909999966621
.. Patience ran out lowering learning rate.
. 

.. Epoch: 7 Era: 0






    



| training  100% Time: 0:00:00                                                 
/ training   17% ETA:  0:00:00                                                 






    



.. Cost                : 0.0189514
... Learning Rate       : 6.98337203175e-07
... Momentum            : 0.911666631699
.. Patience ran out lowering learning rate.
. 

.. Epoch: 8 Era: 0






    



| training  100% Time: 0:00:00                                                 
| validation  100% Time: 0:00:00                                               






    



.. Validation accuracy : 98.1
.. Training accuracy : 99.4
.. Cost                : 0.0189819
... Learning Rate       : 6.63420323121e-08
... Momentum            : 0.91333335638
.. Patience ran out lowering learning rate.
. 

.. Epoch: 9 Era: 0






    



| training  100% Time: 0:00:00                                                 
| training   16% ETA:  0:00:00                                                 






    



.. Cost                : 0.0189849
... Learning Rate       : 6.30249274991e-09
... Momentum            : 0.914999961853
.. Patience ran out lowering learning rate.
. 

.. Epoch: 10 Era: 0






    



| training  100% Time: 0:00:00                                                 
| validation  100% Time: 0:00:00                                               






    



.. Validation accuracy : 98.1
.. Training accuracy : 99.4
.. Cost                : 0.0189851
... Learning Rate       : 5.98736782376e-10
... Momentum            : 0.91666662693
.. Patience ran out lowering learning rate.
. 

.. Epoch: 11 Era: 0






    



| training  100% Time: 0:00:00                                                 
\ training   15% ETA:  0:00:00                                                 






    



.. Cost                : 0.0189851
... Learning Rate       : 5.68799937706e-11
... Momentum            : 0.918333292007
.. Patience ran out lowering learning rate.
. 

.. Epoch: 12 Era: 0






    



| training  100% Time: 0:00:00                                                 
| validation  100% Time: 0:00:00                                               






    



.. Validation accuracy : 98.1
.. Training accuracy : 99.4
.. Cost                : 0.0189851
... Learning Rate       : 5.4035994429e-12
... Momentum            : 0.919999957085
.. Patience ran out lowering learning rate.
. 

.. Epoch: 13 Era: 0






    



| training  100% Time: 0:00:00                                                 
/ training   17% ETA:  0:00:00                                                 






    



.. Cost                : 0.0189851
... Learning Rate       : 5.13341918886e-13
... Momentum            : 0.921666622162
.. Patience ran out lowering learning rate.
. 

.. Epoch: 14 Era: 0






    



| training  100% Time: 0:00:00                                                 
| validation  100% Time: 0:00:00                                               






    



.. Validation accuracy : 98.1
.. Training accuracy : 99.4
.. Cost                : 0.0189851
... Learning Rate       : 4.87674848692e-14
... Momentum            : 0.923333287239
.. Patience ran out lowering learning rate.
. 

.. Epoch: 15 Era: 0






    



| training  100% Time: 0:00:00                                                 
/ training   17% ETA:  0:00:00                                                 






    



.. Cost                : 0.0189851
... Learning Rate       : 4.63291107951e-15
... Momentum            : 0.924999952316
.. Patience ran out lowering learning rate.
. 

.. Epoch: 16 Era: 0






    



| training  100% Time: 0:00:00                                                 
| validation  100% Time: 0:00:00                                               






    



.. Validation accuracy : 98.1
.. Training accuracy : 99.4
.. Cost                : 0.0189851
... Learning Rate       : 4.40126525025e-16
... Momentum            : 0.926666617393
.. Patience ran out lowering learning rate.
. 

.. Epoch: 17 Era: 0






    



| training  100% Time: 0:00:00                                                 
/ training   17% ETA:  0:00:00                                                 






    



.. Cost                : 0.0189851
... Learning Rate       : 4.18120178921e-17
... Momentum            : 0.928333282471
.. Patience ran out lowering learning rate.
. 

.. Epoch: 18 Era: 0






    



| training  100% Time: 0:00:00                                                 
| validation  100% Time: 0:00:00                                               






    



.. Validation accuracy : 98.1
.. Training accuracy : 99.4
.. Cost                : 0.0189851
... Learning Rate       : 3.97214198099e-18
... Momentum            : 0.929999947548
.. Patience ran out lowering learning rate.
. 

.. Epoch: 19 Era: 0






    



| training  100% Time: 0:00:00                                                 
/ training   17% ETA:  0:00:00                                                 






    



.. Cost                : 0.0189851
... Learning Rate       : 3.77353462345e-19
... Momentum            : 0.931666612625
.. Patience ran out lowering learning rate.
.. Learning rate was already lower than specified. Not changing it.
.. Old learning rate was :3.5848579569e-20
.. Was trying to change to: 0.001
. 

.. Epoch: 20 Era: 1






    



| training  100% Time: 0:00:00                                                 
| validation  100% Time: 0:00:00                                               






    



.. Validation accuracy : 98.1
.. Training accuracy : 99.4
.. Cost                : 0.0189851
... Learning Rate       : 3.5848579569e-20
... Momentum            : 0.933333277702
.. Early stopping
.. Training complete.Took 0.79049325 minutes

This time, let us not let it run the forty epochs, let us cancel in the middle after some epochs by hitting ^c. Once it stops lets immediately test and demonstrate that the net retains the parameters as updated as possible. Some new arguments are introduced here and they are for the most part easy to understand in context. epoch represents a tuple which is the number of epochs of training and number of epochs of fine tuning epochs after that. There could be several of these stages of finer tuning. Yann uses the term ‘era’ to represent each set of epochs running with one learning rate. show_progress will print a progress bar for each epoch. validate_after_epochs will perform validation after such many epochs on a different validation dataset.

Once done, lets run net.test():-



In [13]:

    
net.test()









    



.. Testing






    



| testing  100% Time: 0:00:00                                                  






    



.. Testing accuracy : 97.86

The full code for this tutorial with additional commentary can be found in the file pantry.tutorials.mlp.py. If you have toolbox cloned or downloaded or just the tutorials downloaded, Run the code as,



In [16]:

    
from yann.pantry.tutorials.mlp import mlp
mlp(dataset = data.dataset_location())









    



---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-16-429707094830> in <module>()
----> 1 from yann.pantry.tutorials.mlp import mlp
      2 mlp(dataset = data.dataset_location())

ImportError: No module named pantry.tutorials.mlp

or simply ,

python pantry/tutorials/mlp.py

from the toolbox root or path added to toolbox. The __init__ program has all the required tools to create or load an already created dataset. Optionally as command line argument you can provide the location to the dataset.



In [ ]:



In [ ]: